SUMMARY

Perceptions of priorities are influenced, in part, by how frequently we hear about an issue and also the context in which we hear it. Can we detect systematic differences in language reflecting political priorities and biases? Here, using standard NLP (Natural Lanuguage Processing) techniques, I explore this question looking for differences in the texts from recent Republican and Democratic presidential debates. Key findings are:
1. “Wordcloud” visualization reveals some stylistic differences between candidates but no clarity on specifi postiions. .
2. Word-frequencies of selected “key-words” suggest positions differences. A z-statistic can be used to highlight signficant differences between candidates.
3. Initial results for bigram tokenization reveal differences some differences in key-word context.

DATA SOURCES AND METHODS

The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files. From that point all processing is done with R using capabilities of {tm} and associated libraries.

CANDIDATE WORD-CLOUDS

A quick and visually apprealing method to compare texts is word-frequency analysis using the {wordcloud} package in R. Not surprisingly, word choices vary between candidates. However, there are also some striking similarities.

Let’s first compare the word clouds of candidates using the {wordcloud} package.

TRUMP V. SANDERS

Bernie’s word cloud is larger than Donald’s, due to having spoken more total words. (There were three major candidates at the Democratic debate and ten at Republican). What I find most surprising is the similarity of the clouds; words like “people”, “country”, and “going” are common to both. Despite strong differences in policy, word clouds reveal little about them.

c_wordcloud(trump_all)

c_wordcloud(sanders_all)

HILARY V. CARLY

In this case the word clouds couldn’t be more different. Hilary’s emphasis on “think” and “people” differs remarkably from Carly’s emphasis of “government”.

c_wordcloud(clinton_all)

c_wordcloud(fiorina_all)

CRUZ V. HUCKABEE

Ted Cruz’s wordcloud emphasize technicalities, like “taxes” and “washington”, while that of Mike Huckabee, a former minister, seems a mix of Mr. Trump’s and Ms. Fiorina’s.

c_wordcloud(cruz_all)

c_wordcloud(huckabee_all)

STAYING ON MESSAGE: COMPARING DEBATES

We can also split the text by specific debate. Since the debates cover different topics and questions, one might expect to see this reflected in the text of the separate dialogues. What’s surprising here is how comparable the language of each candidate is between the debates.

c_wordcloud(candidate_text_tc("TRUMP", r_oct))

c_wordcloud(candidate_text_tc("TRUMP", r_nov))
c_wordcloud(candidate_text_tc("SANDERS", d_oct))

c_wordcloud(candidate_text_tc("SANDERS", d_nov))

WORD FREQUENCY

We can check word frequency directly by tokenizing the text and counting single words.

Here are the five most frequent words used by the candidates in tabular form.

word trump sanders clinton fiorina sum
think 9 55 90 9 163
know 23 26 56 19 124
well 9 31 56 8 104
people 33 85 53 10 181
going 44 44 45 10 143
government 0 7 6 40 53
every 4 15 9 26 54
need 5 33 36 18 92
will 23 25 32 17 97
country 34 70 25 1 130
SUM 1685 4314 4618 1580 12197

Word counts differ widely. For instance Carly Fiorina said “government”" a total of forty times in her two debates, while Donald Trump didn’t say it at all.
The total number of words spoken by Carly Fiorina was 1580 and her vocabularly of distinct words was 702. By comparison, Bernie Sanders said 4314 total words, with a vocabulary of 1375 words.

NORMALIZED WORD FREQUENCIES

From the above, there may be information in comparing words used frequently by one candidate to frequency of use by another. Here is a graph of the “top” words used by all candidates. From the above we need to be careful to normalize the word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\).

In the graph below the \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.

This is much more informative. For instance, Carly Fiorina mentions the word “government” more than two percent of her word usage, whereas Donald Trump doesn’t mention the word at all. Notice that both Bernie Sanders and Donald Trump mention the word “wall” significantly more than their competitors, while Bernie Sanders alone mentions the word “street” with comparably high frequency. We’ll revisit this below.
Many of the most frequent words convey little information about candidate positions. As with the wordcloud analysis, they convey mostly style.

COEFFICEINT OF VARIATION

To highlight differences between candidates we can look at the standard deviation of the word frequencies normalized to the mean value, or the Coefficient of Variation.

Words with the highest coefficient of variation \(c_v = \sigma/\mu\), where \(\sigma\) is the standard deviation and \(\mu\) is the mean value, are apparent. These include “government”, “street” and others identified above.

KEYWORDS

A way to address the problem of “filler” words is to select for specific “key words” relevant to the topics of interest. The list below combines some “hand selected” and well as those with high coefficeint of variation.

key_words = c("tax", "government", "climate", "class", "wall", "street","terror", "economy", "immigrant", "america", "veteran", "drug", "health", "gun", "education", "bankruptcy", "money", "women", "war", "rights", "abortion", "violence", "theyre", "going" )

An apparent problem is that many of the words of interest have fairly low frequencies. To better distinguish signficant differences, we can calculate a simple \(z\) statistic by taking the mean and standard deviation of the word frequencies.

This approach highlights some fairly interesting differences. For instance:
- Carly Fiorina’s use of the word “government” differs by almost two standard deviations from the other candidates.
- “tax” is used significantly more by Republicans than Democrats as is the word “money”.
- Bernie Sanders is the top user of issue words like “health”, “gun”, “economy”, and “veteran” and many others.
- “women” are mentioned by all candidates except Donald Trump.
- “wall” is mentioend significantly more by Donald Trump and Bernie Sanders than by Hilary Clinton or Carly Fiorina.

WORD ASSOCIATIONS FROM N-GRAM TOKENIZATION

Since word fequency alone does not convey context, let’s look at word associations to see if we can clarify intent and context.
To do this, let’s start with bigram tokenization of the text associated with some of the issue key words. Using the {RWeka} package we can create tables of bi- and tri-grams, which can then be searched using standard regualr expressions.

bigram_table[grep(word, rownames(bigram_table), ignore.case=TRUE)]

“TAX” IN CONTEXT

The word “tax” is heavily used by Carly Fiorina and Donald Trump. Let’s look at the bigrams starting with the word “tax”

We can drill down a bit on “tax code” using trigrams. While the number of cases are limited, we get a slight hint of her position here.

Donald Trump’s choice of words paring with “tax” are mostly negative.

“WALL” IN CONTEXT

The word “wall” is used frequently by both Bernie Sanders and Donald Trump. We can clarify the context by looking at bigrams. In this case it’s clear Bernie Sanders is referring exclusively to “wall street” while Donald Trump mostly refers to his proposal to build border walls.

“THEYRE” IN CONTEXT

Donald Trump uses the word “theyre” signficantly more than other candidates. The context as revealed by bigrams almost sounds like the script of a zombie movie. “theyre going”, “theyre south”, “theyre feeding”, theyre coming“. While no position is advocated here, this does begin to hint at a sentiment that, whoever”they" are, they’re after us.

NOTE: after this anlaysis was completed, the New York Times published a story on linguistic style of Donald Trump. Conclusions are similar.

CONCLUSIONS

Word-clouds provide insight into differences in style but do not delineate well between candiddate positions. Surprisingly opposing candidates can have very similar word clouds. Looking at “most frequent” provides limited insight into differences between candidate positions, though many frequently used words provide no insight By looking at the co-efficient of variance and selecting for key words, we can highlight striking differences between candidates. Bigrams provide key context difference and being to hint at sentiment.

NEXT STEPS

My next step is to expand the text volume by adding more debate text. Since the data suggest candidate speech is largely consistent debate to debate, it might also be beneificial to include speech transscripts if these can be found easily online.
Another avenue is to use pre-defined word vectors to coax simiilarities from the texts. This might help narrow the